(feat): CUDA arr_wrappers — Zero-Alloc CuArray Reuse via setfield!#29
(feat): CUDA arr_wrappers — Zero-Alloc CuArray Reuse via setfield!#29
setfield!#29Conversation
Replace N-way round-robin cache with arr_wrappers[N][slot] pattern (CPU parity). Key changes: - CuTypedPool: views/view_dims/next_way → arr_wrappers - _resize_to_fit!: capacity-aware resize (superset of _resize_without_shrink!) - _cuda_claim_slot!: maxsize-based capacity check avoids spurious GPU realloc - get_array!: DataRef identity check (cu.data.rc !== vec.data.rc) for zero-overhead common path, refcount update only on rare grow-beyond-capacity - _reshape_impl! for CuArray: same-N setfield!, different-N cached wrapper - Safety invalidation updated for arr_wrappers + _resize_to_fit! - Remove CACHE_WAYS constant and Preferences dependency
…afety invalidation _cuda_claim_slot! only checked maxsize-based capacity but didn't restore the logical length of backing vectors after safety invalidation (which sets dims to (0,)). This caused escape detection (Level 3) to fail because vectors appeared empty during overlap checks. Replace manual capacity check with _resize_to_fit! which handles all cases: capacity growth, length restoration, and no-op hot path.
CUDA _acquire_impl! and _unsafe_acquire_impl! now route directly to get_array!, eliminating the get_view! → get_array! indirection. On CUDA, view/array distinction is meaningless (both return CuArray), so all acquire paths converge to the same arr_wrappers-based get_array!. get_view! stubs kept for backward compat (direct callers) but no longer on the main acquire path. Add 8 Mixed-N pattern tests (1D+2D+3D) verifying zero-alloc for both GPU (CUDA.@allocated) and CPU (@allocated) across same-slot different-N, multi-slot mixed-N, varying dims, and unsafe_acquire! variants.
- Add _zero_dims_tuple(N) helper: literal tuples for N≤4 avoid ntuple(_ -> 0, N) dynamic-dispatch allocation on safety invalidation - Apply to TypedPool, BitTypedPool (CPU), CuTypedPool, legacy BitTypedPool - Fix _unsafe_acquire_impl! NTuple overload: delegate to Vararg (matches _acquire_impl! and CPU pattern) - Add CUDA reshape! zero-alloc tests (cross-dim, same-dim, mixed, correctness)
There was a problem hiding this comment.
Pull request overview
This PR updates the CUDA backend to reuse CuArray{T,N} wrappers via an arr_wrappers cache and setfield!, removing the prior fixed-size N-way cache and aiming for zero CPU allocations across unlimited same-N dimension patterns.
Changes:
- Replace CUDA’s N-way view cache with
arr_wrappers-based wrapper reuse (setfield!on:dims, plus DataRef update on rare buffer changes). - Introduce
_resize_to_fit!and capacity-based slot claiming to avoid unnecessary GPU reallocations (especially after safety invalidation). - Extend/refresh CUDA tests to validate zero-alloc behavior for same-
N, mixed-N, andreshape!, plus update safety invalidation expectations.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
ext/AdaptiveArrayPoolsCUDAExt/types.jl |
Updates CuTypedPool struct to add arr_wrappers and remove N-way cache fields. |
ext/AdaptiveArrayPoolsCUDAExt/acquire.jl |
Implements _resize_to_fit!, capacity-based slot claim, arr_wrappers lookup/store, and CUDA _reshape_impl!. |
ext/AdaptiveArrayPoolsCUDAExt/debug.jl |
Updates CUDA safety invalidation to shrink via _resize_to_fit! and invalidate arr_wrappers by zeroing dims. |
ext/AdaptiveArrayPoolsCUDAExt/state.jl |
Updates empty! to clear arr_wrappers instead of N-way cache vectors. |
ext/AdaptiveArrayPoolsCUDAExt/AdaptiveArrayPoolsCUDAExt.jl |
Removes Preferences-based CACHE_WAYS config and reframes extension as arr_wrappers-based. |
src/state.jl |
Adds _zero_dims_tuple helper and uses it during wrapper invalidation. |
src/legacy/state.jl |
Updates legacy BitArray invalidation to call _zero_dims_tuple. |
test/cuda/test_nway_cache.jl |
Renames/expands tests from N-way cache to arr_wrappers, adds mixed-N + reshape! coverage. |
test/cuda/test_extension.jl |
Updates struct-field assertions to expect :arr_wrappers. |
test/cuda/test_cuda_safety.jl |
Updates safety invalidation assertions to check arr_wrappers dims are zeroed. |
test/cuda/test_allocation.jl |
Updates resize tests to _resize_to_fit! and adds grow-within-capacity-after-invalidation coverage. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
Codecov Report❌ Patch coverage is
❌ Your patch status has failed because the patch coverage (53.33%) is below the target coverage (95.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## master #29 +/- ##
==========================================
- Coverage 96.79% 96.54% -0.26%
==========================================
Files 14 14
Lines 2620 2632 +12
==========================================
+ Hits 2536 2541 +5
- Misses 84 91 +7
🚀 New features to boost your workflow:
|
Summary
Replaces the N-way view cache with
arr_wrappers+setfield!-based CuArray reuse — the same zero-allocation strategy already used on CPU (Julia 1.11+). This eliminates the 4-way cache eviction limit: unlimited dimension patterns per N are now zero-alloc.Key Changes
New:
arr_wrappersCache (types.jl,acquire.jl)CuTypedPoolgainsarr_wrappers::Vector{Union{Nothing, Vector{Any}}}(indexed by dimensionality N, per-slot cachedCuArray{T,N})setfield!(wrapper, :dims, dims)— zero allocationwrapper.data.rc !== vec.data.rc, ~2ns) minimizes refcount overhead — only updates:datawhen GPU buffer actually changed (rare grow-beyond-capacity case)CACHE_WAYS,views,view_dims,next_wayfields andPreferencesdependencyNew:
_resize_to_fit!(acquire.jl)setfield!(:dims)when withinmaxsize, delegates toresize!only when beyond capacity_resize_without_shrink!— also optimizes grow-within-capacity (critical for re-acquire after safety invalidation)New: Direct
_acquire_impl!Dispatch (acquire.jl)_acquire_impl!/_unsafe_acquire_impl!route directly toget_array!(noget_view!→get_array!indirection)get_view!retained for backward compat onlyNew:
_reshape_impl!for CuArray (acquire.jl)setfield!(:dims), no pool interactionCuArray{T,N}wrapperFix: Safety Invalidation (
debug.jl,state.jl)_invalidate_released_slots!forCuTypedPool: poison fill +_resize_to_fit!(vec, 0)+ arr_wrappers dims zeroing_cuda_claim_slot!after invalidation_zero_dims_tuple(N): literal tuples for N≤4, avoidsntupledynamic-dispatch allocationTest Coverage
reshape!zero-alloc (cross-dim, same-dim, mixed sequence)_resize_to_fit!unit tests (shrink, grow-back, after-invalidation, beyond-capacity)acquire!andunsafe_acquire!both pathsBreaking Changes
None. Public API is unchanged. Internal
CuTypedPoolstruct fields changed (N-way cache fields removed,arr_wrappersadded).